University of Texas at San Antonio



**Open Cloud Institute**


Machine Learning/BigData EE-6973-001-Fall-2016


**Paul Rad, Ph.D.**

**Ali Miraftab, Research Fellow**



**Context-Dependent Speech Recognition **


Andrew Boles, Milad Mostavi
*University of Texas at San Antonio, San Antonio, Texas, USA*
ckj771@my.utsa.edu, m.mostavi@gmail.com



**Project Definition:** Using deep neural network models, human voice recognition is studied. The dataset includes 74 different speakers for training, with varyious spoken sentences per speaker. There are 10 additional test speakers in the dataset as well.

The data is given in the raw audio (.raw) format which is uncompressed audio in raw form, as if spoken in real time. The file size is comparable to the WAV format. Using convolution and other techniques, the speaker will be determined from his or her speech.

It is the hope that the deep learning process applied to this data set could be expanded to be able to classify an individual given some training data. The classification of the composer with an input audio file is the goal, eventually in real time.

**Outcome:** Applying various convolution methods to recognize a speaker using his or her voice.

**Dataset:** The speech data can be found in [http://www.speech.cs.cmu.edu/databases/an4/]. This directory contains 74 subdirectories, one for each person, named by userid. Each of these directories contains sentences spoken by the same person. There is also a test directory with 10 different speakers.